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ABSTRACT 

Salient object detection has become an important task in many 
image processing applications. The existing approaches exploit 
background prior and contrast prior to attain state of the art results. 
In this paper, instead of using background cues, we estimate the 
foreground regions in an image using objectness proposals and uti¬ 
lize it to obtain smooth and accurate saliency maps. We propose 
a novel saliency measure called ‘foreground connectivity’ which 
determines how tightly a pixel or a region is connected to the esti¬ 
mated foreground. We use the values assigned by this measure as 
foreground weights and integrate these in an optimization frame¬ 
work to obtain the final saliency maps. We extensively evaluate the 
proposed approach on two benchmark databases and demonstrate 
that the results obtained are better than the existing state of the art 
approaches. 

Index Terms — Image Saliency, Objectness Proposals, Image 
Segmentation, Superpixels 

1. INTRODUCTION 

The Human visual system has the ability to process parts of image 
which are relevant, discarding the rest. This helps us to perceive 
objects even before identifying them. Saliency detection, i.e. com¬ 
putationally detecting these relevant regions is a complex problem 
which takes cues from models in cognitive psychology, neurobiol¬ 
ogy and computer vision. It has gained a lot of attention in the recent 
years from the computer vision community owing to its use in object 
recognition m, object segmentation m, image re-targeting [3] and 
cropping (41, image retrieval O etc. 

Works in saliency detection are classified into three categories : 
fixation prediction. Salient object detection and Objectness proposal 
generation. 

Early models were biologically inspired and were evaluated on 
human eye fixation datasets. Ullman and Koch (6) define saliency 
at a given location as how different it is from its surrounding in 
color, orientation, motion, depth etc. Itti et al. |f7| follow the same 
framework and propose a centre-surround contrast using Difference 
of Gaussian to obtain their Saliency maps. Ma and Zhang (S) use 
similar contrast analysis and extend it using fuzzy growth model. 
These models are evaluated on eye fixation databases. These models 
highlight edges and corners and are not suitable for detecting com¬ 
plete salient regions. 

Salient object detection models aim to segment the object as a 
whole and are evaluated mostly on data labeled by humans such as 
bounding boxes or Foreground masks. These methods use low level 
cues such as contrast prior (^EOlIIIl boundary prior 
Methods using contrast prior rely on uniqueness of the object and 
contrast between pixels or regions, center-surround differences etc. 
Methods based on contrast prior may be further classified into global 


or local methods. Local methods involve computing contrast mea¬ 
sure in a local patch, whereas global methods use the entire image 
to compute the saliency. Global methods often use spatial feature as 
two pixels/regions which look similar but are far away need not be¬ 
long to the same object. Global methods fail when the background 
is complex. Achanta et al. O computes saliency based on pixels 
color difference from the mean image color. Cheng, et al cni com¬ 
bines global contrast with spatial differences to generate a saliency 
map. Perazzi et al. El computes saliency by decomposing the im¬ 
age into homgeneous elements. Contrast and spatial distribution are 
used to obtain pixel-accuarate saliency maps. These are estimated 
using high-dimensional Gaussian filter. 

However, contrast prior alone is not very effective. The other 
most commonly used cue is based on the assumption that most 
photographers do not crop the salient object along the view frame. 
Hence the image boundary forms the background. However bound¬ 
ary prior is fragile and its prone to fail even when the object is 
slightly touching the background. Wei et al. El propose a saliency 
measure based on shortest length between an image patch and a vir¬ 
tual boundary node and it overcomes the shortcomings of boundary 
prior by connecting the boundary regions to a virtual node through 
an edge with suitable boundary weight. Yang et al. El ranks simi¬ 
larity of image regions with foreground or background cues using a 
graph-based manifold ranking. The ranking is based on relevance of 
an element with respect to the given queries. Zhu et al. El propose 
a boundary connectivity measure that utilizes both contrast prior and 
boundary prior. Foreground and Background weights obtained are 
then combined using an optimization framework. 

Objectness proposal generation methods propose small number 
of windows that are likely to contain the object in an image thereby 
reducing search space for classifiers. Alexe et al. El propose an 
objectness measure that combines several image cues measuring an 
objects’ characteristics in a Bayesian framework. Zhang et al. El 
propose cascaded ranking SVM to generate an ordered set of pro¬ 
posals. Cheng et al. El proposes a binarized version of normed 
gradient features (BING) which can be tested using few atomic op¬ 
erations to generate Objectness proposals. 

Jiang et al El integrates Objectness with Uniqueness and Fo- 
cusness to obtain saliency maps. However these maps are not smooth 
and it is difficult to attribute these results to specific algorithm prop¬ 
erties El 

In this work, rather than obtaining the background from image 
boundary and using the boundary prior, we quickly obtain a rough 
estimate of foreground regions by utilizing a modified version of 
the recently proposed objectness proposal technique El . We then 
compute super-pixel objectness which is a measure that quantifies 
how likely it is for a super-pixel to be a part of the foreground. The 
foreground and background regions are obtained by appropriately 
thresholding the above measure. We propose a robust saliency mea¬ 
sure called foreground connectivity which assigns saliency values to 
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Fig. 1. Illustration of main phases of our algorithm, (a) Input Image, (b) Thresholded Objectness Map. (c) Foreground weights (d) Saliency 
Map after Optimization (e) Ground truth 


these super-pixels. 

Rather than combining the cues heuristically (weighted summa¬ 
tion or multiplication), we use a principled optimization framework 
proposed by ca that regards saliency as a global optimization prob¬ 
lem. The values assigned to the super-pixels are is considered as 
the foreground weights in the cost function. Minimizing the cost 
function would result in foreground regions taking higher values and 
background regions taking lower values. The obtained maps are also 
smooth and uniform due to the smoothness constraint in the cost 
function. 

We have extensively evaluated our method on MSRA-1000 
and CSSD dataset ||20l and we show that the proposed method per¬ 
forms at par or even better than the existing methods. 

The paper is organized as follows. Section 2 describes various 
modules of the proposed approach including the new ‘Foreground 
connectivity’ metric. Experiments and results are discussed in sec¬ 
tion 3 and sec. 4 concludes the paper. 

2. METHODOLOGY 

The proposed approach is as follows. We build an Objectness Map 
using Objectness Proposals to capture super-pixels containing the 
object. Next, using the foreground connectivity measure, we assign 
foreground weights to super-pixels. We use Saliency optimization 
technique to combine our foreground weights with background mea¬ 
sure as used in GSl to obtain smooth and accurate saliency maps. 
Figure 1 illustrates the main phases of our algorithm. 

2.1. Objectness Map 

Objects are stand-alone things with well-defined closed boundaries 
and centers d. When windows containing objects are resized to 
a smaller size, the magnitude of norm of image gradients (NG) be¬ 
come good discriminative features. These normed gradients are inert 
to change of scale/aspect ratio and translation. The fact that objects 
share some correlation in the NG spaces is utilized in BING tTE\ to 
detect objects. The image is resized into fixed sizes and the normed 
gradient values in 8 x 8 region is used as 64 dimensional normed 
gradient feature. These windows are scored with a learned linear 
model w G 

S; = (w,g;), (1) 

l = {i,x,y), (2) 

where and (x, y) are filter score, NG feature, location, size 

and position of a window respectively. Using non-maximal suppre- 
sion, top proposals are chosen and are re-ranked based on their loca¬ 
tion and size using coefficients learned using a linear SVM. BING is 
accurate and extremely fast as the features are binarized and only a 


few atomic operations are required for obtaining the objectness pro¬ 
posals. 

In the proposed approach we have adapted BING by modifying 
the following modules. Instead of using a learned linear model, we 
use an 8 X 8 Laplacian of Gaussian like filter and obtained scores 
for the windows. We skip re-ranking them based on the learned co¬ 
efficients. The proposed model has higher weights placed along the 
edges and it resembles the center-surround patterns 0. With this 
model, we also cut down the training time. 

After obtaining the Objectness proposals, we generate the Ob¬ 
jectness Map. Objectness score of a window tells us how likely it 
is to contain an object. We use these objectness proposals to obtain 
pixel-wise objectness (PixObj) score which tells us how likely it is 
for a pixel to be a part of an object. Pixel-wise objectness score is 
given by 

k 

PixObj(p) = SiGi{x,y) (3) 

i = l 

where si, S 2 ...Sk are the objectness scores of the proposals contain¬ 
ing pixel p and Gi is a Gaussian window having same dimensions as 
that of the given proposal , x and y are relative x and y coordinate 
of pixel p with respect to the given proposal. 

Sum of pixel-wise object probability in a super pixel region 
gives us Objectness score of that super pixel region (which is used 
to construct the Objectness Map). 

Objectness{R) = PixObj{pi) (4) 

i^R 

where pi is a pixel belonging to super pixel region R. To obtain 
super pixel regions, we use SLIC ED as it is fast and it preserves 
boundaries. 

We use adaptive thresholding to obtain the Objectness Map. We 
observe that these Objectness Maps in few cases are able to segment 
out the salient object completely, however most objectness maps ei¬ 
ther miss out parts of object or include parts of background in it. This 
happens as the Objectness proposals are rectangular regions contain¬ 
ing both foreground and background. 

2.2. Foreground Connectivity 

Thresholded Objectness Maps roughly capture the super-pixels 
which are a part of the foreground. This is exemplified in Figure 2. 
We propose a novel saliency measure called ’foreground connectiv¬ 
ity’ that assigns saliency values based on a super-pixels connectivity 
to the estimated foreground. We construct a graph with super-pixels 
as nodes. Super-pixels that are adjacent in the images are connected 
by an edge with a weight equivalent to Euclidean distance of their 
mean LAB values. We now define foreground connectivity of a 
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Fig. 3. Visual Comparison of Saliency Maps, (a) is the original image. Saliency map obtained using (b) GS m (c) SF ua (d) MR Qa (e) 
soca (f) Proposedd and (g) Ground Truth . The proposed approach generates Saliency maps that are accurate, smooth and uniform 



Fig. 2. Rough estimate of foreground obtained using Objectness 
Maps 


where d{R, Rk) denotes the shortest distance between R to Rk and 
(5(.) is 1 for a super-pixel if it is estimated as foreground by the Ob¬ 
jectness Map, and N is the total number of super-pixels. 

A higher similarity of a super-pixel with the estimated fore¬ 
ground ensures lower value in the numerator and a higher value in 
the denominator leading to less value of FG (implying higher con¬ 
nectivity). We take the reciprocal of FG and use it as the foreground 
weights 


2.3. Saliency Optimization 


super-pixel R as : 

^ Etid{R,Rk)-5iRk) 

Ek=idiR,Rk).il-5{Rk)) 


Generally, several saliency cues are combined heuristically using 
weighted summation or multiplication. Instead we use an existing 
optimization framework to combine our foreground weights with 
background weights as used in ca. The cost function to be min- 
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Fig. 4. Comparison of PR curves on (a)MSRA-lOOO database and (b) CSSD-200 database 


imized is defined as 

N N 

- 1)^ + + Y^Wij{ti - tjf (6) 

i=l i = l i,j 

where ti denotes the final value of saliency assigned to pi after mini¬ 
mizing the cost, w{^ denotes foreground weights, denotes back¬ 
ground weights associated with super-pixel pi. High w{^ encour¬ 
ages Pi to take values close to 1 and high encourages pi to take 
values close to 0. Wij is the smoothness coefficient.We use the same 
parameter settings as used in E). 

3. RESULTS 

We evaluate the proposed approach on two benchmark datasets. The 
first one is the MSRA-1000 dataset which is one of the most 
extensively used databases. The second one is the CSSD dataset 
eqi. The MSRA-1000 dataset has large varieties in content and 
background but are simple and smooth. The CSSD dataset on the 
other hand has structurally complex images for evaluation. Results 
obtained on both the datasets are evaluated on ground-truth masks 
labeled by humans. We compare the results of our algorithm with 
recent four state of the art methods : Saliency Filter (SF) El , 
geodesic Saliency (GS) ITdl , Manifold Ranking (MR) and 

Saliency Optimization (SO) |T31 . Results on MSRA-1000 and 

CSSD Go) are shown in Figure 3. We evaluate our method using 
Precision-Recall curves and Mean Absolute Error (MAE). 

3.1. Precision and Recall 

Precision is the fraction of pixels assigned correctly against the to¬ 
tal number of pixels assigned salient. Whereas recall is the fraction 
of pixels labeled correctly in relation to the number of ground truth 
pixels. Precision and recall vary inversely and hence it is essential 
to evaluate them simultaneously. Hence we use a Precision-Recall 
curve similar to previous works . Eor various value of threshold be¬ 
tween [0 ... 255], we obtain binary maps and using the ground truth 
mask,we compute precision and recall values. Eigure 4 shows that 
the proposed approach performs at par with the other state of the 
art algorithms. However precision recall curves have some serious 
limitation. Precision-recall curves do not consider the fraction of 
pixels correctly assigned as not salient. Presence of pixels incor¬ 
rectly assigned as salient brings down the performance of saliency 
map despite it being smooth and having higher values assigned to 
pixels that are salient. 


3.2. Mean Absolute Error 

To overcome this limitation, we use Mean Absolute Error (MAE) as 
suggested by El It measures how similar a saliency map is to the 
ground truth. Eor a Saliency map S, ground truth mask G, then MAE 
is defined as 

H W 

IS(x,y) - G(x,y)l (7) 

x=ly=l 

where H and W denote the height and width of the image. Results 
have been averaged out over all the images in the database. The 
proposed approach performs better than the state of the art methods 
in terms of MAE (see Table 1). 



MSRA 

CSSD 

GS (Ml 

0.109 

0.178 

SFfBl 

0.129 

0.204 

MR El 

0.085 

0.150 

SO El 

0.068 

0.136 

Proposed 

0.064 

0.132 


Table 1. Comparison of MAE values of different saliency methods 


3.3. Running Time 

The average running time of the proposed approach on an Intel Core 
i5-4200U CPU @ 1.60 GHz with 6GB RAM is 0.27s excluding pre¬ 
processing and Superpixel segmentation using SLIC ED. 

4. CONCLUSION 

In this paper, we present a simple and efficient method that utilizes 
objectness Proposals and foreground connectivity measure to detect 
salient objects in an image. Unlike recent methods, we obtain our 
saliency maps by estimating foreground regions in an image instead 
of using boundary priors. Our method combined with the optimiza¬ 
tion framework produces accurate and smooth saliency maps that 
perform better than other methods in terms of MAE when tested on 
two widely used datasets. In future, we plan to investigate better 
cues rather than depending on contrast or boundary prior alone, bet¬ 
ter connectivity measures and better objectness proposal techniques 
that can perform well with backgrounds that are even more complex. 
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